Goto

Collaborating Authors

 mixture signal


UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures

Neural Information Processing Systems

In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture).


Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

Neural Information Processing Systems

Neural sequence-to-sequence models are well established for applications which can be cast as mapping a single input sequence into a single output sequence. In this work, we focus on one-to-many sequence transduction problems, such as extracting multiple sequential sources from a mixture sequence. We extend the standard sequence-to-sequence model to a conditional multi-sequence model, which explicitly models the relevance between multiple output sequences with the probabilistic chain rule. Based on this extension, our model can conditionally infer output sequences one-by-one by making use of both input and previously-estimated contextual output sequences. This model additionally has a simple and efficient stop criterion for the end of the transduction, making it able to infer the variable number of output sequences. We take speech data as a primary test field to evaluate our methods since the observed speech data is often composed of multiple sources due to the nature of the superposition principle of sound waves. Experiments on several different tasks including speech separation and multi-speaker speech recognition show that our conditional multi-sequence models lead to consistent improvements over the conventional non-conditional models.


Review for NeurIPS paper: Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

Neural Information Processing Systems

Weaknesses: Generality: The idea is not as general as has been presented. It is quite similar to techniques like multisource decoding [1], where the decoder of a network is conditioned on multiple input sequences, and deliberation models [2]. Furthermore, if one uses attention based models that directly model p(Y X), conditioning multiple sequences is quite straightforward, and not uncommon. For example, if there are multiple sequences Y1, Y1, by concatenating Y1 and Y2, one can model p(Y1, Y2 X) P(Y1 X) P(Y2 Y1, X), which naturally conditions on both input and prior sequences (see, e.g., [4]). The novelty lies mostly in how this is being applied to multitalker separation tasks, or specifically, multiple tasks from the same domain where order of the output doesn't matter much.


Review for NeurIPS paper: Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

Neural Information Processing Systems

All reviewers agree that the paper is an interesting contribution (a conditional chain model), for an important problem (multi-sequence problem, with application to ASR). There was concerns about the experimental section on the weak side, as well as some unclear points. However, reviewers found the rebuttal convincing enough and raised their scores accordingly.


UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures

Neural Information Processing Systems

In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture). At each training step, we feed an input mixture to a deep neural network (DNN) to produce an intermediate estimate for each speaker, linearly filter the estimates, and optimize a loss so that, at each microphone, the filtered estimates of all the speakers can add up to the mixture to satisfy the above constraint. We show that this loss can promote unsupervised separation of speakers. The linear filters are computed in each sub-band based on the mixture and DNN estimates through the forward convolutive prediction (FCP) algorithm.


Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

Neural Information Processing Systems

Neural sequence-to-sequence models are well established for applications which can be cast as mapping a single input sequence into a single output sequence. In this work, we focus on one-to-many sequence transduction problems, such as extracting multiple sequential sources from a mixture sequence. We extend the standard sequence-to-sequence model to a conditional multi-sequence model, which explicitly models the relevance between multiple output sequences with the probabilistic chain rule. Based on this extension, our model can conditionally infer output sequences one-by-one by making use of both input and previously-estimated contextual output sequences. This model additionally has a simple and efficient stop criterion for the end of the transduction, making it able to infer the variable number of output sequences.


Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation

Bando, Yoshiaki, Masuyama, Yoshiki, Nugraha, Aditya Arie, Yoshii, Kazuyoshi

arXiv.org Artificial Intelligence

This paper describes an efficient unsupervised learning method for a neural source separation model that utilizes a probabilistic generative model of observed multichannel mixtures proposed for blind source separation (BSS). For this purpose, amortized variational inference (AVI) has been used for directly solving the inverse problem of BSS with full-rank spatial covariance analysis (FCA). Although this unsupervised technique called neural FCA is in principle free from the domain mismatch problem, it is computationally demanding due to the full rankness of the spatial model in exchange for robustness against relatively short reverberations. To reduce the model complexity without sacrificing performance, we propose neural FastFCA based on the jointly-diagonalizable yet full-rank spatial model. Our neural separation model introduced for AVI alternately performs neural network blocks and single steps of an efficient iterative algorithm called iterative source steering. This alternating architecture enables the separation model to quickly separate the mixture spectrogram by leveraging both the deep neural network and the multichannel optimization algorithm. The training objective with AVI is derived to maximize the marginalized likelihood of the observed mixtures. The experiment using mixture signals of two to four sound sources shows that neural FastFCA outperforms conventional BSS methods and reduces the computational time to about 2% of that for the neural FCA.


Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection

Cornell, Samuele, Balestri, Thomas, Sénéchal, Thibaud

arXiv.org Artificial Intelligence

In these are increasingly realistic or include celebrity-derived custom instances, the performance of tasks such as keyword-spotting voices. This can lead to the device "self waking" and continuously (KWS) and device-directed speech detection (DDD) can degrade interrupting itself as the model, alone, cannot implicitly significantly. To address this problem, we propose an distinguish between user and device speech and ignore this implicit acoustic echo cancellation (iAEC) framework where latter. Such problem also affects automatic speech recognition a neural network is trained to exploit the additional information (ASR) or keyword-less initiated interactions, such as from a reference microphone channel to learn to ignore device-directed detection (DDD) [7-10]. One trivial way to the interfering signal and improve detection performance. We mitigate this issue would be disabling the KWS functionality study this framework for the tasks of KWS and DDD on, while the device is in playback. Yet, doing so prevents the respectively, an augmented version of Google Speech Commands user to "barge in", making the interaction significantly less v2 and a real-world Alexa device dataset.


Deep Bayesian Unsupervised Source Separation Based on a Complex Gaussian Mixture Model

Bando, Yoshiaki, Sasaki, Yoko, Yoshii, Kazuyoshi

arXiv.org Machine Learning

This paper presents an unsupervised method that trains neural source separation by using only multichannel mixture signals. Conventional neural separation methods require a lot of supervised data to achieve excellent performance. Although multichannel methods based on spatial information can work without such training data, they are often sensitive to parameter initialization and degraded with the sources located close to each other. The proposed method uses a cost function based on a spatial model called a complex Gaussian mixture model (cGMM). This model has the time-frequency (TF) masks and direction of arrivals (DoAs) of sources as latent variables and is used for training separation and localization networks that respectively estimate these variables. This joint training solves the frequency permutation ambiguity of the spatial model in a unified deep Bayesian framework. In addition, the pre-trained network can be used not only for conducting monaural separation but also for efficiently initializing a multichannel separation algorithm. Experimental results with simulated speech mixtures showed that our method outperformed a conventional initialization method.


A variance modeling framework based on variational autoencoders for speech enhancement

Leglaive, Simon, Girin, Laurent, Horaud, Radu

arXiv.org Machine Learning

Grenoble Alpes, Grenoble INP, GIPSA-lab, France ABSTRACT In this paper we address the problem of enhancing speech signals in noisy mixtures using a source separation approach. We explore the use of neural networks as an alternative to a popular speech variance model based on supervised nonnegative matrix factorization (NMF). More precisely, we use a variational autoencoder as a speaker-independent supervised generative speech model, highlighting the conceptual similarities that this approach shares with its NMF-based counterpart. In order to be free of generalization issues regarding the noisy recording environments, we follow the approach of having a supervised model only for the target speech signal, the noise model being based on unsupervised NMF. We develop a Monte Carlo expectation-maximization algorithm for inferring the latent variables in the variational autoencoder and estimating the unsupervised model parameters. Experiments show that the proposed method outperforms a semi-supervised NMF baseline and a state-of- the-art fully supervised deep learning approach. Index Terms-- Audio source separation, speech enhancement, variational autoencoders, nonnegative matrix factorization, Monte Carlo expectation-maximization 1. INTRODUCTION Speech enhancement is a classical problem of speech processing, which aims to recover a clean speech signal from the recording of a noisy signal, where the noise is generally considered as additive [1].